Characterization of Consistent Global Checkpoints in Large-Scale Distributed Systems
نویسندگان
چکیده
Backward error recovery is one of the most used schemes to ensure fault-tolerance in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation in an error-free global state from which it can be resumed to produce a correct behaviour. Checkpointing is one of the techniques to pursue the backward error recovery. As we consider large-scale distributed systems, on one side a coordinated approach to take checkpoints is not practicable, on the other side for an uncoordinated approach the probability to have a domino effect during a recovery could be no longer negligible. In this paper, we present a framework that allows first to define formally the domino effect and second to state and prove a theorem to determine if an arbitrary set of checkpoints is consistent. This theorem is very general as it considers a semantic including missing and orphan messages. This plays a key role in designing uncoordinated checkpointing algorithms that require to take as less additional checkpoints as possible in order to ensure domino-free recovery.
منابع مشابه
Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database system
Checkpointing and rollback recovery are well-known techniques for handling failures in distributed systems. The issues related to the design and implementation of efficient checkpointing and recovery techniques for distributed systems have been thoroughly understood. For example, the necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint has be...
متن کاملIndependent global snapshots in large distributed systems
Distributed systems depend on consistent global snapshots for process recovery and garbage collection activity. We provide exact conditions for an arbitrary checkpoint based on independent dependency tracking within clusters of nodes.. The method permits that nodes (within clusters) can independently compute dependency information based on available ( local ) information. The existing models of...
متن کاملTransaction-Consistent Global Checkpoints in a Distributed Database System
Checkpointing and rollback recovery are well-known techniques for handling failures in distributed database systems. In this paper, we establish the necessary and sufficient conditions for the checkpoints on a set of data items to be part of a transaction-consistent global checkpoint of the distributed database. This can throw light on designing efficient, non-intrusive checkpointing techniques...
متن کاملDistributed multi-agent Load Frequency Control for a Large-scale Power System Optimized by Grey Wolf Optimizer
This paper aims to design an optimal distributed multi-agent controller for load frequency control and optimal power flow purposes. The controller parameters are optimized using Grey Wolf Optimization (GWO) algorithm. The designed optimal distributed controller is employed for load frequency control in the IEEE 30-bus test system with six generators. The controller of each generator is consider...
متن کاملFinding Consistent Global Checkpoints in a Distributed Computation
Finding consistent global checkpoints of a distributed computation is important for analyzing, testing, or verifying properties of these computations. In this paper we present a theoretical foundation for nding consistent global checkpoints. Given an arbitrary set S of local checkpoints, we prove exactly which sets of other local checkpoints can be combined with S to build consistent global che...
متن کامل